{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 正则表达式\n",
    "\n",
    "字符串是编程时涉及到的最多的一种数据结构，对字符串进行操作的需求几乎无处不在。比如判断一个字符串是否是合法的Email地址，虽然可以编程提取@前后的子串，再分别判断是否是单词和域名，但这样做不但麻烦，而且代码难以复用。\n",
    "\n",
    "正则表达式是一种用来匹配字符串的强有力的武器。它的设计思想是用一种描述性的语言来给字符串定义一个规则，凡是符合规则的字符串，我们就认为它“匹配”了，否则，该字符串就是不合法的。\n",
    "\n",
    "所以我们判断一个字符串是否是合法的Email的方法是：\n",
    "\n",
    "创建一个匹配Email的正则表达式；\n",
    "\n",
    "用该正则表达式去匹配用户的输入来判断是否合法。\n",
    "\n",
    "因为正则表达式也是用字符串表示的，所以，我们要首先了解如何用字符来描述字符。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### re.match函数\n",
    "\n",
    "函数语法：\n",
    "\n",
    "```re.match(pattern, string, flags=0)```\n",
    "\n",
    "pattern\t匹配的正则表达式\n",
    "\n",
    "string\t要匹配的字符串。\n",
    "\n",
    "flags\t标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等等。参见：正则表达式修饰符 - 可选标志"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#在起始位置匹配\n",
    "print(re.match('www', 'www.163.com').span())\n",
    "print(re.match('163', 'www.163.com'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "line = \"Cats Are smarter than dogs\"\n",
    "\n",
    "matchObj = re.match( r'(.*) are (.*?) .*', line, re.I)\n",
    " \n",
    "if matchObj:\n",
    "    print(\"matchObj.group() : \", matchObj.group())\n",
    "    print(\"matchObj.group(1) : \", matchObj.group(1))\n",
    "    print(\"matchObj.group(2) : \", matchObj.group(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. re.I\t 使匹配对大小写不敏感\n",
    "2. re.L\t 做本地化识别（locale-aware）匹配\n",
    "3. re.M\t 多行匹配，影响 ^ 和 $\n",
    "4. re.S\t 使 . 匹配包括换行在内的所有字符\n",
    "5. re.U\t 根据Unicode字符集解析字符。这个标志影响 \\w, \\W, \\b, \\B.\n",
    "6. re.X\t 该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### re.search方法\n",
    "\n",
    "re.search 扫描整个字符串并返回第一个成功的匹配。\n",
    "\n",
    "函数语法：\n",
    "\n",
    "```re.search(pattern, string, flags=0)```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(re.search('www', 'www.163.com').span())  # 在起始位置匹配\n",
    "print(re.search('163', 'www.163.com').span())         # 不在起始位置匹配"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "line = \"Cats Are smarter than dogs\"\n",
    "\n",
    "matchObj = re.search( r'(.*) are (.*?) .*', line, re.I)\n",
    " \n",
    "if matchObj:\n",
    "    print(\"matchObj.group() : \", matchObj.group())\n",
    "    print(\"matchObj.group(1) : \", matchObj.group(1))\n",
    "    print(\"matchObj.group(2) : \", matchObj.group(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### re.match与re.search的区别\n",
    "re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search匹配整个字符串，直到找到一个匹配。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 练习，找出“dogs”\n",
    "line = \"Cats are smarter than dogs\";\n",
    " \n",
    "matchObj = re.???( r'dogs', line, re.I)\n",
    "if matchObj:\n",
    "    print(\"match --> matchObj.group() : \", matchObj.group())\n",
    "else:\n",
    "    print(\"No match!!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 检索和替换\n",
    "Python 的 re 模块提供了re.sub用于替换字符串中的匹配项。\n",
    "\n",
    "语法：\n",
    "\n",
    "```re.sub(pattern, repl, string, count=0, flags=0)```\n",
    "\n",
    "- pattern : 正则中的模式字符串。\n",
    "- repl : 替换的字符串，也可为一个函数。\n",
    "- string : 要被查找替换的原始字符串。\n",
    "- count : 模式匹配后替换的最大次数，默认 0 表示替换所有的匹配。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "phone = \"0757-86547-1548 # 这是一个电话号码\"\n",
    " \n",
    "# 删除字符串中的 Python注释 \n",
    "num = re.sub(r'#.*$', \"\", phone)\n",
    "print(\"电话号码是: \", num)\n",
    " \n",
    "# 删除非数字(-)的字符串 \n",
    "num = re.sub(r'\\D', \"\", phone)\n",
    "#num = re.sub(r'-', \"\", num)\n",
    "print(\"电话号码是 : \", num)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- \\d\t匹配一个Unicode数字，如果带re.ASCII，则匹配0-9\n",
    "- \\D 匹配Unicode非数字\n",
    "- \\s\t匹配Unicode空白，如果带有re.ASCII，则匹配\\t\\n\\r\\f\\v中的一个\n",
    "- \\S 匹配Unicode非空白\n",
    "- \\w\t匹配Unicode单词字符，如果带有re.ascii,则匹配[a-zA-Z0-9_]中的一个\n",
    "- \\W 匹配Unicode非单子字符"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将匹配的数字乘以 2\n",
    "def double(matched):\n",
    "    value = int(matched.group('value'))\n",
    "    return str(value * 2)\n",
    " \n",
    "s = 'A23G4HFD423'\n",
    "print(re.sub('(?P<value>\\d+)', double, s))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 分组匹配\n",
    "import re\n",
    "s = '110223199003060030'\n",
    "res = re.search('(?P<province>\\d{3})(?P<city>\\d{3})(?P<born_year>\\d{4})(?P<born_month>\\d{2})(?P<born_date>\\d{2})',s)\n",
    "print(res.groupdict())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 练习：不用正则表达方式实现同样功能"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### re.compile 函数\n",
    "compile 函数用于编译正则表达式，生成一个正则表达式（ Pattern ）对象，供 match() 和 search() 这两个函数使用。\n",
    "\n",
    "语法格式为：\n",
    "\n",
    "```re.compile(pattern[, flags])```\n",
    "\n",
    "- pattern : 一个字符串形式的正则表达式\n",
    "\n",
    "- flags : 可选，表示匹配模式，比如忽略大小写，多行模式等，具体参数为：\n",
    "\n",
    "- re.I 忽略大小写\n",
    "- re.L 表示特殊字符集 \\w, \\W, \\b, \\B, \\s, \\S 依赖于当前环境\n",
    "- re.M 多行模式\n",
    "- re.S 即为 . 并且包括换行符在内的任意字符（. 不包括换行符）\n",
    "- re.U 表示特殊字符集 \\w, \\W, \\b, \\B, \\d, \\D, \\s, \\S 依赖于 Unicode 字符属性数据库\n",
    "- re.X 为了增加可读性，忽略空格和 # 后面的注释"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(r'\\d+')\n",
    "m = pattern.match('one12twothree34four')\n",
    "print(m)\n",
    "m = pattern.match('one12twothree34four', 3, 10)\n",
    "print(m)\n",
    "print(m.group(0))\n",
    "print(m.start(0), m.end(0), m.span())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)\n",
    "m = pattern.match('Hello World Wide Web')\n",
    "print(m)\n",
    "print(m.group(0))  # 返回匹配成功的整个子串\n",
    "print(m.group(1))  # 返回第一个分组匹配成功的子串\n",
    "print(m.group(2))  # 返回第二个分组匹配成功的子串\n",
    "print(m.groups())  # 等价于 (m.group(1), m.group(2), ...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### findall\n",
    "在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，如果没有找到匹配的，则返回空列表。\n",
    "\n",
    "注意： match 和 search 是匹配一次 findall 匹配所有。\n",
    "\n",
    "语法格式为：\n",
    "\n",
    "```findall(string[, pos[, endpos]])```\n",
    "\n",
    "参数：\n",
    "\n",
    "- string : 待匹配的字符串。\n",
    "- pos : 可选参数，指定字符串的起始位置，默认为 0。\n",
    "- endpos : 可选参数，指定字符串的结束位置，默认为字符串的长度。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(r'\\d+')\n",
    "m = pattern.findall('one12twothree34four')\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### re.split\n",
    "split 方法按照能够匹配的子串将字符串分割后返回列表，它的使用形式如下：\n",
    "\n",
    "```re.split(pattern, string[, maxsplit=0, flags=0])```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "res = re.split('\\W+', 'abx, 123sd, good.')\n",
    "print(res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 练习不用正则表达，实现相同功能\n",
    "s =  'sd1xxx2aa2a3sd3xx12yy'\n",
    "res = re.split('\\d+', s)\n",
    "print(res)\n",
    "# 用一般方法..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 正则表达式模式\n",
    "模式字符串使用特殊的语法来表示一个正则表达式：\n",
    "\n",
    "字母和数字表示他们自身。一个正则表达式模式中的字母和数字匹配同样的字符串。\n",
    "\n",
    "多数字母和数字前加一个反斜杠时会拥有不同的含义。\n",
    "\n",
    "标点符号只有被转义时才匹配自身，否则它们表示特殊的含义。\n",
    "\n",
    "反斜杠本身需要使用反斜杠转义。\n",
    "\n",
    "由于正则表达式通常都包含反斜杠，所以你最好使用原始字符串来表示它们。模式元素(如 r'\\t'，等价于 '\\\\t')匹配相应的特殊字符。\n",
    "\n",
    "下表列出了正则表达式模式语法中的特殊元素。如果你使用模式的同时提供了可选的标志参数，某些模式元素的含义会改变。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基本符号\n",
    "\n",
    "- ^  表示匹配字符串的开始位置  (例外  用在中括号中[ ] 时,可以理解为取反,表示不匹配括号中字符串)\n",
    "- $  表示匹配字符串的结束位置\n",
    "- *  表示匹配 零次到多次\n",
    "- +  表示匹配 一次到多次 (至少有一次)\n",
    "- ?  表示匹配零次或一次\n",
    "- .  表示匹配单个字符 \n",
    "- |  表示为或者,两项中取一项\n",
    "- (  ) 小括号表示匹配括号中全部字符\n",
    "- [  ] 中括号表示匹配括号中一个字符 范围描述 如[0-9 a-z A-Z]\n",
    "- {  } 大括号用于限定匹配次数  如 {n}表示匹配n个字符  {n,}表示至少匹配n个字符  {n,m}表示至少n,最多m\n",
    "- \\  转义字符 如上基本符号匹配都需要转义字符   如 \\*  表示匹配*号\n",
    "- \\w 表示英文字母和数字  \\W  非字母和数字\n",
    "- \\d  表示数字   \\D  非数字\n",
    "----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "常用的正则表达式(转)\n",
    "匹配中文字符的正则表达式： [\\u4e00-\\u9fa5]\n",
    "\n",
    "匹配双字节字符(包括汉字在内)：[^\\x00-\\xff]\n",
    "\n",
    "匹配空行的正则表达式：\\n[\\s| ]*\\r\n",
    "\n",
    "匹配HTML标记的正则表达式：/<(.*)>.*<\\/\\1>|<(.*) \\/>/ \n",
    "\n",
    "匹配首尾空格的正则表达式：(^\\s*)|(\\s*$)\n",
    "\n",
    "匹配IP地址的正则表达式：/(\\d+)\\.(\\d+)\\.(\\d+)\\.(\\d+)/g //\n",
    "\n",
    "匹配Email地址的正则表达式：\\w+([-+.]\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*\n",
    "\n",
    "匹配网址URL的正则表达式：http://(/[\\w-]+\\.)+[\\w-]+(/[\\w- ./?%&=]*)?\n",
    "\n",
    "sql语句：^(select|drop|delete|create|update|insert).*$\n",
    "\n",
    "1、非负整数：^\\d+$ \n",
    "\n",
    "2、正整数：^[0-9]*[1-9][0-9]*$ \n",
    "\n",
    "3、非正整数：^((-\\d+)|(0+))$ \n",
    "\n",
    "4、负整数：^-[0-9]*[1-9][0-9]*$ \n",
    "\n",
    "5、整数：^-?\\d+$ \n",
    "\n",
    "6、非负浮点数：^\\d+(\\.\\d+)?$ \n",
    "\n",
    "7、正浮点数：^((0-9)+\\.[0-9]*[1-9][0-9]*)|([0-9]*[1-9][0-9]*\\.[0-9]+)|([0-9]*[1-9][0-9]*))$ \n",
    "\n",
    "8、非正浮点数：^((-\\d+\\.\\d+)?)|(0+(\\.0+)?))$ \n",
    "\n",
    "9、负浮点数：^(-((正浮点数正则式)))$ \n",
    "\n",
    "10、英文字符串：^[A-Za-z]+$ \n",
    "\n",
    "11、英文大写串：^[A-Z]+$ \n",
    "\n",
    "12、英文小写串：^[a-z]+$ \n",
    "\n",
    "13、英文字符数字串：^[A-Za-z0-9]+$ \n",
    "\n",
    "14、英数字加下划线串：^\\w+$ \n",
    "\n",
    "15、E-mail地址：^[\\w-]+(\\.[\\w-]+)*@[\\w-]+(\\.[\\w-]+)+$ \n",
    "\n",
    "16、URL：^[a-zA-Z]+://(\\w+(-\\w+)*)(\\.(\\w+(-\\w+)*))*(\\?\\s*)?$ \n",
    "或：^http:\\/\\/[A-Za-z0-9]+\\.[A-Za-z0-9]+[\\/=\\?%\\-&_~`@[\\]\\':+!]*([^<>\\\"\\\"])*$\n",
    "\n",
    "17、邮政编码：^[1-9]\\d{5}$\n",
    "\n",
    "18、中文：^[\\u0391-\\uFFE5]+$\n",
    "\n",
    "19、电话号码：^((\\d2,3)|(\\d{3}\\-))?(0\\d2,3|0\\d{2,3}-)?[1-9]\\d{6,7}(\\-\\d{1,4})?$\n",
    "\n",
    "20、手机号码：^((\\d2,3)|(\\d{3}\\-))?13\\d{9}$\n",
    "\n",
    "21、双字节字符(包括汉字在内)：^\\x00-\\xff\n",
    "\n",
    "22、匹配首尾空格：(^\\s*)|(\\s*$)（像vbscript那样的trim函数）\n",
    "\n",
    "23、匹配HTML标记：<(.*)>.*<\\/\\1>|<(.*) \\/> \n",
    "\n",
    "24、匹配空行：\\n[\\s| ]*\\r\n",
    "\n",
    "25、提取信息中的网络链接：(h|H)(r|R)(e|E)(f|F) *= *('|\")?(\\w|\\\\|\\/|\\.)+('|\"| *|>)?\n",
    "\n",
    "26、提取信息中的邮件地址：\\w+([-+.]\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*\n",
    "\n",
    "27、提取信息中的图片链接：(s|S)(r|R)(c|C) *= *('|\")?(\\w|\\\\|\\/|\\.)+('|\"| *|>)?\n",
    "\n",
    "28、提取信息中的IP地址：(\\d+)\\.(\\d+)\\.(\\d+)\\.(\\d+)\n",
    "\n",
    "29、提取信息中的中国手机号码：(86)*0*13\\d{9}\n",
    "\n",
    "30、提取信息中的中国固定电话号码：(\\d3,4|\\d{3,4}-|\\s)?\\d{8}\n",
    "\n",
    "31、提取信息中的中国电话号码（包括移动和固定电话）：(\\d3,4|\\d{3,4}-|\\s)?\\d{7,14}\n",
    "\n",
    "32、提取信息中的中国邮政编码：[1-9]{1}(\\d+){5}\n",
    "\n",
    "33、提取信息中的浮点数（即小数）：(-?\\d*)\\.?\\d+\n",
    "\n",
    "34、提取信息中的任何数字 ：(-?\\d*)(\\.\\d+)? \n",
    "\n",
    "35、IP：(\\d+)\\.(\\d+)\\.(\\d+)\\.(\\d+)\n",
    "\n",
    "36、电话区号：/^0\\d{2,3}$/\n",
    "\n",
    "37、腾讯QQ号：^[1-9]*[1-9][0-9]*$\n",
    "\n",
    "38、帐号(字母开头，允许5-16字节，允许字母数字下划线)：^[a-zA-Z][a-zA-Z0-9_]{4,15}$\n",
    "\n",
    "39、中文、英文、数字及下划线：^[\\u4e00-\\u9fa5_a-zA-Z0-9]+$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
